Predicting preterm births from electrohysterogram recordings via deep learning

About one in ten babies is born preterm, i.e., before completing 37 weeks of gestation, which can result in permanent neurologic deficit and is a leading cause of child mortality. Although imminent preterm labor can be detected, predicting preterm births more than one week in advance remains elusive. Here, we develop a deep learning method to predict preterm births directly from electrohysterogram (EHG) measurements of pregnant mothers recorded at around 31 weeks of gestation. We developed a prediction model, which includes a recurrent neural network, to predict preterm births using short-time Fourier transforms of EHG recordings and clinical information from two public datasets. We predicted preterm births with an area under the receiver-operating characteristic curve (AUC) of 0.78 (95% confidence interval: 0.76-0.80). Moreover, we found that the spectral patterns of the measurements were more predictive than the temporal patterns, suggesting that preterm births can be predicted from short EHG recordings in an automated process. We show that preterm births can be predicted for pregnant mothers around their 31st week of gestation, prompting beneficial treatments to reduce the incidence of preterm births and improve their outcomes.

Introduction a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 higher risk of preterm birth, but these risk factors alone are not sufficient to accurately predict which individual mothers will deliver preterm [6][7][8].
In clinical practice, preterm birth is usually predicted by measuring cervical length or the concentration of cervico-vaginal fibronectin alpha [8]. In mothers with symptoms of preterm labor, these minimally invasive tests can predict births that will occur within one week [9,10]. Moreover, the combination of these tests has been reported to produce more accurate results than each method separately and could be used to predict preterm births in symptomatic mothers within two weeks of testing [11]. These measurements are helpful because they inform physicians and guide treatments to reduce the risk of preterm labor and to improve its outcomes. However, these measurements are not cost-effective screening tools for the general population of pregnant mothers because they have low predictive values among mothers at low risk for preterm labor, such as nulliparous women with singleton pregnancies [8,12].
Home uterine activity monitors (HUAMs) were developed to measure uterine contractions and predict preterm births. The first such devices were based on tocodynamometer recordings, which measure the pressure changes associated with uterine contractions [13]. Unfortunately, these devices could not predict preterm births, and current clinical guidelines discourage their use for this purpose [8,13,14].
More recently, electrohysterogram (EHG) recordings have been proposed to predict preterm births [15,16]. EHG recordings use abdominal electrodes to measure the electrical activity associated with uterine contractions, and they can be recorded with portable devices equipped with algorithms to monitor uterine contractions [15,17]. A variety of algorithms have been developed with the aim of predicting preterm births from various features derived from EHG measurements [16,18]. These features are generally calculated from uterine contraction intervals, either manually selected or identified using dedicated algorithms [16,19,20]. These intervals can also be identified with the aid of simultaneous tocodynamometer recordings [17,19].
To the best of our knowledge, EHG measurements have not yet been shown to predict preterm births more than two weeks in advance with a performance comparable to the clinical standards, i.e., using measurements of cervical length or fibronectin alpha for detecting imminent labor. Although many researchers have reported nearly perfect predictions of preterm births based on EHG measurements from the "Term-Preterm EHG Database," meticulous analysis revealed that these results were overoptimistic and resulted from data leakage [21,22]. Namely, these works inadvertently introduced strong correlations between the data used to train the prediction models and the data used to test the performance of these models, as shown by Vandewiele et al. [16,21,22]. This problem was caused by inappropriate attempts to improve the models' performance by balancing the number of term and preterm samples used to develop these models. After Vandewiele et al. corrected this problem, these models were no longer able to predict preterm births accurately [21,22]. Additional works with sound methodology suggest that some features derived from EHG measurements can be used to distinguish between recordings of mothers who eventually delivered at term from those who delivered preterm [23][24][25][26]. However, none of these works could predict preterm births with clinically useful accuracy. More recently, Xu et al. and Lou et al. developed methods for predicting preterm births avoiding data leakage [27,28]. Although Xu et al. and Lou et al. achieved high classification performances on test sets including real and synthetic measurements, the performances of their approaches on test sets including only real measurements are not reported. Moreover, Fischer et al. used an end-to-end deep learning model to predict preterm births from EHG measurements without artificially increasing the number of preterm samples to avoid possible data leakage and achieved only a moderate accuracy [29].
Here, we present an end-to-end deep learning model that predicts preterm births directly from EHG measurements, without handcrafted features. Therefore, our model is not sensitive to varying implementations of specific features or to how uterine contractions are segmented. We developed our work using EHG measurements and supplementary clinical information from two public databases. Importantly, we developed our model with care to avoid data leakage. Using our model, we could predict preterm births in pregnant mothers around their 31st week of gestation. Our predictive accuracy was close to that achieved by using cervical length and fibronectin alpha measurements to predict preterm labors in mothers with symptoms of preterm labor and within one week of delivery. Moreover, by investigating the measurement components that contribute to the predictions of our model, we showed that it is possible to predict preterm births using short recording times, thus facilitating clinical adoption and athome implementation of EHG measurements. This finding is aligned with the observations of Jager et al., who proposed that preterm births can be predicted from short contractile or noncontractile intervals of EHG measurements with similar accuracy as when using 30-minute long recordings [19]. Our work and results encourage using EHG measurements and deep learning for predicting preterm births in real-world scenarios. Their successful employment could help reduce newborn morbidity and mortality, especially in populations with limited access to healthcare, who suffer more from preterm birth [2].

Study participants
In developing our work, we used two datasets in the Physionet repository, aggregating data from the "Term-Preterm EHG Database" (TPEHG DB) [23,30] and from the "Term-Preterm ElectroHysteroGram DataSet with Tocogram" (TPEHGT DS) [19,30]. These datasets contain bipolar EHG measurements, with nearly every recording lasting 30 minutes, and clinical information obtained from pregnant mothers during regular pregnancy checkups, as well as from mothers hospitalized for threatened preterm labor. Both datasets were acquired at the University Medical Centre Ljubljana, using the same recording protocol and device. The TPEHG DB consists of 300 records, each obtained from a different mother at either around the 22nd or the 32nd week of gestation. Additionally, the TPEHGT DS contains 26 records from 18 different mothers, obtained around the 31st week of gestation. Half of the samples in the TPEHGT DS correspond to mothers who eventually delivered preterm, while the other 13 records correspond to term deliveries. When compiling these datasets, the datasets' authors excluded the mothers whose labors were induced or whose deliveries were performed using a Cesarean section [19,23].
We included the records from both datasets obtained after the 26th week of gestation. We used this threshold of 26 weeks following the grouping of gestational ages at the time of the recordings in the TPEHG DB [23]. Since each record in the TPEHG DB was obtained from a different mother, we included all the records from this database that were obtained after the 26th week of gestation. On the other hand, when there were multiple records for the same mother in the TPEHGT DS, we included only the latest record during the pregnancy, provided that the record was made after the 26th week of gestation. We identified the records in the TPEHGT DS that corresponded to a particular mother by comparing the clinical information. By using a single record per mother, we prevented our models from learning features that characterize mothers rather than features that are predictive of pregnancy outcomes. Overall, we used 159 records from different mothers. Each of these records lasted 30 minutes, except for two records that were 26 and 33 minutes long. To facilitate the data analysis, we zero-padded the 26-minute-long record and truncated the 33-minute-long record, so that all the records were 30 minutes long. Among these mothers, 18.9% delivered preterm. We detail the clinical information of these mothers in Table 1. Additionally, we illustrate the distribution of gestational ages of the mothers included at the times of recording and at birth in S1 and S2 Figs.

Prediction models
We developed classification and regression models to predict a term or a preterm birth. The classification models were trained specifically to predict categorical outcomes, i.e., delivery at term or preterm. Pursuing a different approach, we trained the regression models to predict the gestational age at delivery, labeling predictions lower than 37 completed weeks, or 259 days, as preterm, and those above 37 weeks as term. In developing the classification and regression models, we used clinical information alone, EHG measurements alone, and clinical information combined with EHG measurements. These prediction models, developed using MATLAB 2020a, are detailed in the next subsections and summarized in a block diagram in Fig 1. Clinical information models. First, using only the clinical information of the records, we predicted whether each mother delivered preterm or at term. We used most of the predictors shown in Table 1, namely maternal age, gestational age at the time of the recording, weight, whether the mothers had given birth previously (parous), had aborted pregnancies previously, had reported vaginal bleeding in the first trimester, had reported vaginal bleeding in the second trimester, or were smokers. We excluded diagnoses of hypertension and diabetes because these diagnoses are mostly absent in this dataset. We also excluded diagnoses of funnelling because they are made through transvaginal sonography and because, in this dataset, these diagnoses have a low predictive power [8]. Similar to funnelling, we excluded the variable in the dataset indicating the placental position, which takes the values "front" (considered as the positive value in Table 1) and "end." We completed the missing entries for each variable in the training and testing datasets using the mode of that variable in the training set. To prevent data leakage, rather than using the modes of the entire dataset, we used the modes of the samples in the training set to complete missing entries in both the training and testing sets [31]. Therefore, our training data does not contain any information from the test set and when making predictions, our model uses only information from the training set to both complete missing entries and make predictions. In other words, our model makes predictions on each sample of the testing dataset using only information from the training set. Next, we trained a logistic regression to predict whether deliveries were preterm and a linear regression model to predict the gestational age at birth. These models are represented using a block with a blue outline on the upper part of Fig 1. In the logistic regression model, we discarded the redundant predictors, using lasso regularization. We regularized only the classification model, and not the regression model, because we observed that the lasso regularization improved the performance of the logistic regression model slightly but marginally worsened the performance of the linear regression model. Since we regularized the logistic regression model, we also normalized the predictors in this model to prevent the regularization The clinical information model is illustrated in the upper part of the diagram, using shapes with blue outlines. This model uses clinical information, in tabular format, to predict preterm births by using logistic or linear regression, represented as a block with a blue outline and schematic illustrations below it. Preprocessing the clinical information consists of completing missing entries and normalizing the predictors, as described in Methods. The EHG model is illustrated in the lower part of the diagram, using shapes with black outlines. This model uses EHG measurements, represented by an input block with a schematic illustration below it, that are first preprocessed. This preprocessing step includes bandpass filtering (BPF) and downsampling. The preprocessed measurements are used to compute STFTs, illustrated by a block and a schematic representation, that are used as input to the RNN. This network is composed of an input layer, a BiLSTM layer, a fully connected (FC) layer, and an output layer, which are illustrated using light blue shapes with black outlines and enclosed within a dashed light blue outline. The combined model uses clinical information and EHG measurements to predict preterm births and is illustrated in the middle part of the diagram using shapes with red outlines. The dotted black outline represents the cross-validation technique employed, indicating that the operations within are applied separately for each data partition, whereas the operations outside are applied to all the data, independent of the data partition. term from penalizing the model parameters based on the scale of the predictors. Again, to prevent data leakage, we normalized both the training and testing sets using the means and standard deviations of the samples in the training set, thus avoiding revealing information from the test set to the training set.
The operation to complete missing entries described above, together with the operation to normalize the input data, comprise the preprocessing step for the clinical information. This preprocessing step is represented in Fig 1 as a block that is executed once for each partitioning of the data into training and testing datasets, as described below.
EHG measurements models. Then, using only the EHG measurements, we predicted whether the mothers delivered preterm or at term. We used solely the first signal (s1) in the databases, in agreement with the recommendation of Garcia-Casado et al. of using simple systems for predicting preterm birth [18]. This signal measures the electric potential difference between two electrodes aligned horizontally on the abdomen, 3.5 cm above the navel, and separated by seven cm.
We preprocessed all the EHG measurements to improve the data quality. We removed the first minute of the recordings to remove transient effects. Next, we filtered the measurements to remove baseline wander and high frequency noise. Specifically, we filtered the recordings using a fourth-order, Butterworth bandpass filter with zero-phase and cutoff frequencies of 0.05 Hz and 4 Hz. Although most uterine activity is concentrated between 0.05 Hz and 0.7 Hz, we included a higher frequency range because higher frequency components have been shown to be predictive of preterm birth [19,32]. Finally, we downsampled the measurements to 10 Hz to improve computational speed without losing information. These preprocessing operations are represented using a block with a black outline at the bottom of Fig 1. Next, as illustrated in the bottom part of Fig 1, we transformed the preprocessed EHG measurements to the time-frequency domain, using the short-time Fourier transform (STFT). We used the STFT following the positive results previously reported using this transformation for predicting preterm births from EHG measurements [19,27]. The STFT usefully represents how the spectral components of the measurements change over time by constructing a matrix where each column corresponds to a sliding time interval and contains the estimated spectral content of the measurements during the corresponding time interval. This transformation is helpful in analyzing non-stationary processes, such as the contractile activity during the recordings. We estimated the STFT using Hamming windows of 60 s that were slid using a 75% overlap. We chose this configuration since uterine contractions usually last around one minute and because this configuration resulted in satisfactory temporal and spectral resolutions based on visual inspection [33].
We predicted the pregnancies' outcomes from EHG recordings using a deep neural network, rather than using handcrafted features, because neural networks automatically learn the most informative features from the data [34,35]. Given the limited success of various methods designed to predict preterm births from the EHG measurements in the TPEHG DB using handcrafted features, Vandewiele et al. suggested using deep learning to achieve better results [22].
In agreement with the suggestions from Vandewiele et al., we used a deep recurrent neural network (RNN) to predict the pregnancies' outcomes from EHG measurements, developing a dedicated network architecture for this task. This RNN uses the training set, consisting of data samples labeled with their respective pregnancy outcome, to learn features from the input data that predict the pregnancies' outcomes. The RNN consists of a series of layers that are trained to learn multiple abstractions of the data that are helpful in relating the input data to the predictions [34]. The first layer in our network is a sequence input layer that rearranges the matrices of STFTs so that the columns of the STFT matrices, which capture the spectral content of the measurements during the sliding time intervals, become a set of features for the corresponding time step in the RNN. This input layer feeds into a series of bidirectional long shortterm memory (BiLSTM) cells with 100 hidden states. The BiLSTM cells are able to learn patterns from sequential data: in our case, these cells are intended to learn patterns from the spectral changes of the EHG measurements over time. Similar network architectures, using long short-term memory (LSTM) and BiLSTM cells, have been used to successfully learn informative data representations from STFTs in other applications [36,37].
Next, using a similar approach as Zhu et al., we connected the last BiLSTM cell to a fully connected layer consisting of two neurons, and finally we connected the fully connected layer to an output layer [36]. The fully connected layer encodes the data abstraction inferred by the BiLSTM cells into a pair of scalar values, which are then used by the output layer to make a prediction. In the classification model, this pair of scalar values scores the association of each EHG recording to the preterm and term categories. This architecture is illustrated in Fig 1. We used two different output layers, depending on whether we intended to predict the categorical outcome of the pregnancy or to predict the gestational age at birth. For the classification problem, we used a softmax output layer and trained the network using a weighted crossentropy loss function that penalized errors in the preterm birth predictions more. We determined the weights of the loss function based on the relative frequency of each class in the training set, a strategy that addresses the class-imbalance problem of predicting preterm births. Namely, because term labors are more frequent in the general population and in the database, classification models trained on these data are naturally biased towards predicting term labors and may learn to predict term labors for every input. This loss function is given by: where N is the number of samples in each training batch, w i is the penalization weight of each class, T n = {0, 1} is the label of sample n, and y n is the output score of the sample n. We set the penalization weight for class i to be: where S i is the number of samples from class i in the training set, as suggested in [38].
For the regression problem, we used a regression layer as the output of the network. This layer implements a mean square error (MSE) loss function to train the network. Since the regression models are trained on a continuous output, i.e., the gestational age at birth, these models are less sensitive to the class imbalance problem. The classification and regression output layers are represented by a single blue block with a black outline at the bottom of Fig 1. In developing our prediction models using EHG measurements, we evaluated alternative model designs based on a single run of a five-fold cross-validation. We evaluated alternative time-frequency representations, namely wavelet transforms and the empirical mode decomposition, as described in [39,40]. Additionally, we tested other neural network architectures, namely using long short-term memory (LSTM) cells and convolutional neural networks (CNN), with varying network parameters, such as the numbers of layers and the number of LSTM cells. Here, we report the model that produced the best prediction results.
We also fine-tuned the learning parameters based on a single run of a five-fold cross-validation. Namely, we selected an appropriate mini-batch size, number of training iterations, learning rate, and regularization hyperparameter.
Combined models. We developed both a classification and a regression model that combine clinical information with EHG measurements to predict pregnancies' outcomes. We hypothesized that combining all the available information can improve the performance of our models, as previously suggested [18]. We first trained the network described in the previous subsection. Then, we extracted the activation values of the fully connected layer and concatenated these values with the clinical information. Next, we used the combined data to train the logistic regression model to predict the outcome of the pregnancy, and the linear regression model to predict the gestational age at delivery. We implemented these logistic and linear regression models as described before. The difference between these models and those used for predictions based only on clinical information is that, in this case, the data vectors included the activations of the fully connected layers in addition to the clinical information. The stages of these classification and regression models, which combine clinical information and EHG measurements, are illustrated in the middle part of Fig 1.

Cross-validation
We evaluated the performance of our models using a stratified, five-fold cross-validation. We partitioned the data into a training set, containing 80% of the data, and a test set, containing the remaining 20% of the data, so that both the training and testing sets included the same proportion of preterm samples. We illustrate our data partitioning in S3 Fig. We used the training set to train our models and the testing set to evaluate the models' performance. We repeated this process five times, each time using a different set of samples for the training and testing set, so that all the samples were used for testing throughout the five runs. This cross-validation routine is indicated in Fig 1 by a dotted black outline. This outline symbolizes that the operations represented within are applied separately for each partition of the data, whereas the operations represented outside the outline are applied once for all the data.

Statistical analysis
To evaluate the performance of the prediction models with confidence intervals and to reduce the risk of bias, we repeated the cross-validation routine 20 times, as recommended in [41]. Each time, we used a different random partition of the data. By repeating the cross-validation routine with various random partitions, we prevented our models from possibly producing over-optimistic results due to fitting of the training hyperparameters and model specifications to a specific cross-validation partition. We then calculated the mean and 95% confidence interval (CI) of the performance statistics, assuming that the performance statistics had Gaussian distributions with unknown means and variances.

Performance of the prediction models
First, we attempted to predict preterm births by using only the clinical information, which supplements the EHG measurements and is described in Table 1. We developed two models: a logistic regression model to determine whether a pregnancy would result in a preterm birth, and a linear regression model to predict the gestational age at delivery, as detailed in Methods. When using the regression model, we predicted that a birth would be at term if the estimated gestational age at delivery was at least 37 complete weeks, or 259 days. The classification model predicted preterm births with an area under the receiver-operating characteristic curve (AUC) of 0.65 (95% CI: 0.63-0.67), whereas the regression model predicted preterm births with an AUC of 0.67 (95% CI: 0.65-0.70).
Next, we examined whether EHG measurements could be used to predict preterm births using end-to-end deep-learning models, directly from EHG measurements and without requiring handcrafted features. Specifically, we trained a recurrent neural network to predict whether the pregnant mothers would deliver preterm and to predict their gestational ages at delivery, as described in Methods. This network's predictions surpassed those of the clinical information models. The classification model trained on EHG measurements was able to predict preterm births with an AUC of 0.74 (95% CI: 0.73-0.76), whereas the regression model predicted preterm births with an AUC of 0.70 (95% CI: 0.68-0.73).
We also developed models to predict preterm births based on clinical information combined with EHG measurements, as described in Methods. We hypothesized that integrating the clinical information and the EHG measurements would yield more accurate prediction models, because the models trained independently on clinical information alone and EHG measurements alone could predict preterm births better than random guessing. Moreover, the clinical information and the EHG measurements provide complementary information about the pregnancy. Consistent with our hypothesis, the prediction models trained on both clinical information and EHG measurements slightly outperformed the models trained on clinical information alone and on EHG measurements alone. Our classification model predicted preterm births with an AUC of 0.78 (95% CI: 0.76-0.80), and the regression model predicted preterm births with an AUC of 0.75 (95% CI: 0.73-0.77).
To better evaluate the performance of our prediction models, we estimated a performance bound on this classification problem. In our work, as well as in the obstetrics literature and clinical practice, births are considered preterm if the mother delivers the fetus before completing 37 weeks of gestation. However, the gestational age of the mother has an uncertainty that depends on the method used to estimate it. Generally, gestational age is estimated based on a first trimester ultrasound examination or on the timing of the last menstrual period (LMP) [42]. When the gestational age is estimated based on early ultrasound examination, the estimate has a standard deviation of about five days, whereas estimates based on the LMP have standard deviations of about seven days [43]. Notably, the incidence of preterm births depends on the method used to estimate the gestational age [44].
This estimation error translates into uncertainty in the ground truth labels and limits the possible performance of classification algorithms. We estimated the upper bound of the AUC due to this limitation by measuring the AUC obtained when predicting the gestational age at delivery using a noisy version of the true gestational ages at delivery. We corrupted the gestational ages at delivery by adding independent and identically distributed (i.i.d.) Gaussian noise with zero mean and a standard deviation of six days. After repeating this procedure 20 times to estimate the mean and 95% CI of this AUC using this approach, we found that the upper AUC bound for this classification problem is 0.98 (95% CI: 0.98-0.98).
In Fig 2, we present the receiver-operating characteristic curves (ROC) for the classification and regression models trained on clinical information alone, EHG measurements alone, and clinical information combined with EHG measurements. We observe that the classification models which leverage EHG measurements outperform the regression models trained on the same data. Moreover, we notice that regardless of whether we use the classification or regression approach, the EHG-based models outperform the clinical information-based models and that the models that leverage both the clinical information and the EHG measurements achieve the best performance.
To further assess the performance of our models, we measured the sensitivity, positive predictive value (PPV), and negative predictive value (NPV) at various specificity levels, as shown in Table 2 [45]. We include the PPV and NPV in our analysis because these statistics consider the incidence of preterm births in the dataset [45]. Since the classification models that use EHG measurements outperformed the regression models, we present the results only for the classification models.
In Table 2, we observe that the combined model outperforms the models trained on clinical information alone or EHG measurements alone in sensitivity, PPV, and NPV at various specificity levels. Moreover, we observe that our models have a much higher NPV than PPV, which results from the low incidence of preterm births. In other words, our predictions of term births are more reliable than our predictions of preterm births.
We verified that our model was not discriminating between the two datasets used in our work. The TPEHG DB and the TPEHGT DS datasets were acquired with the same device and following the same protocol, so we did not expect that our model would discriminate between the samples of either dataset. We confirmed that our model does not assign one label to the samples from one dataset and another label to the samples of the other dataset. Moreover,  when we trained the classification models using only the TPEHG DB, we obtained similar AUCs to those obtained when we trained the models using data from both datasets. Although our regression models could predict preterm births more accurately than random guessing, these models were not able to predict the gestational ages at delivery with a much lower MSE than the MSE obtained using the mean gestational age at delivery in the training set, i.e., the minimum MSE estimator. Although the correlation between the predicted and true gestational ages at delivery is positive, the accuracy of the predictions is low, as shown in S4 Fig.

Predictive components of EHG measurements
Further, we investigated how various components of the EHG measurements contribute to the preterm birth predictions by altering the STFT representations of the data. We first explored the predictive power of various frequency bands, as shown in Fig 3a and 3b. We extracted four frequency bands (B0 through B3) by using only the relevant rows of the STFT for training and testing. We considered similar frequency bands as Jager et al.: in our case, B0, B1, B2, and B3, cover the frequency ranges between 0.05 Hz and 1.0 Hz, 1.0 Hz and 2.2 Hz, 2.2 Hz and 3.5 Hz, and 3.5 Hz and 5.0 Hz, respectively [19]. The only difference between our spectral partition and that proposed by Jager et al. is that, in our case, the lower frequency cutoff of B0 is 0.05 Hz instead of 0.08 Hz [19]. According to Jager et al., B0 mostly contains electrical activity associated with uterine contractions, whereas the higher bands contain harmonic frequencies of uterine reverberation caused by maternal cardiac activity [19]. Notably, we observed that the models trained on higher frequency bands achieved higher AUCs, as shown in Fig 3b. Next, we examined how the temporal patterns of the measurements contribute to the models' predictions. We disrupted the temporal patterns by randomly rearranging a random subset of columns of the STFTs, as illustrated in Fig 3c. The AUC of the model did not significantly change as larger fractions of columns of the STFTs were rearranged, as shown in Fig 3d. Notably, when all the columns of the STFTs were randomly rearranged, i.e., when all the temporal patterns were disrupted, our classification model trained on disrupted EHG measurements alone was able to predict preterm births with an AUC of 0.74 (95% CI: 0.72-0.76).
Based on our observations from disrupting the spectral and temporal patterns, we hypothesized that the predictions of our model are guided more by the spectral composition of the measurements than by their temporal patterns. Hence, we sought to predict preterm births using shorter EHG recordings. The duration of EHG recordings, usually between 30 and 60 minutes, is an important hindrance to their implementation in clinical settings, where personnel resources are often limited [23,46].
To test this hypothesis, we trained and tested our model using cropped STFTs, as shown in Fig 3e. We removed columns at the beginning and at the end of the STFTs to simulate shorter EHG measurements. Since the initial point selected for these shortened STFTs slightly affects the resulting AUC, we selected a random initial point for each shortened sample. Remarkably, the performance of our model decreased only marginally with decreasing measurement duration, as shown in Fig 3f. When we trained our model using one-minute long recordings, we could predict preterm births with an AUC of 0.71 (95% CI: 0.69-0.73), which is only slightly lower than the 0.74 (95% CI: 0.73-0.76) AUC we obtained using the entire 30-minute long recordings.

Discussion
We developed a deep learning method to predict preterm births from EHG measurements and clinical information obtained from two public databases. We predicted preterm births with good accuracy directly from the data and without using handcrafted features, manual annotations, or simultaneous tocography measurements. Thus, our method potentially enables automatic prediction of preterm births from EHG recordings.
To assess the performance of our method from the perspective of clinical practice, we compared the performance of our method with the performances reported for other technologies and methods to predict preterm births, as shown in Table 3. For this comparison, we included results only from studies published in peer-reviewed journals, with sound methodology, that reported the AUC of the predictions, and which included at least 50 pregnant mothers. Similarly to the datasets used in this work, the results reported in these studies correspond to obstetric populations excluding medically induced births. However, whereas many of these studies included either mothers with or without symptoms of preterm labor, the TPEHG DB and the TPEHGT DS contain EHG recordings obtained both during regular checkups and from mothers hospitalized with symptoms of preterm labor. Unfortunately, we could not distinguish EHGs based on whether the mothers had symptoms of preterm labor during the recordings because this information is not provided in the datasets.
From this comparison, we observe that the performance of our method is superior to the performance of existing methods to predict preterm births that take place before 37 complete weeks of gestation. Importantly, our method outperforms the gold standard biomarkers of preterm birth, i.e., cervical length and fibronectin alpha, in this task. Moreover, the performance of our method in predicting preterm births in mothers around their 31st week of gestation is relatively close to the performance of the gold standard tests in predicting preterm birth within only one week in mothers with symptoms of preterm labor. Our results support previous findings suggesting that preterm birth can be predicted by using EHG measurements from around the 31st week of gestation [16,23].
Additionally, we investigated how the temporal and spectral components of the EHG measurements contribute to our model's predictions. We observed that the higher frequency components of the EHG measurements are more predictive of preterm births. A possible explanation for this phenomenon is that the higher frequency bands contain spectral harmonics of the electrical activity in EHG measurements and more spectral information may be coded in the higher frequency bands. However, further research is needed to decipher the sources of the various spectral components of EHG measurements.
Importantly, we observed that the temporal patterns measured in EHG measurements are not crucial to predicting preterm births. This observation agrees with the results published by Iams et al., who showed that the frequency of uterine contractions is not predictive of preterm births [49]. Moreover, this observation also might explain the inability of tocography, which measures the temporal patterns of uterine contractions, to predict preterm births [13]. Inspired by this observation, and specifically by our results presented in Fig 3d, we explored whether we could use shorter EHG measurements to predict preterm births.
Notably, we found that shortening the EHG measurements did not substantially degrade the performance of our model in predicting preterm births. Our findings using short EHG measurements suggest that shorter EHG recordings could be sufficient to predict preterm births. From the perspective of clinical adoption, a shorter recording is easier for the users and saves cost [18]. Moreover, the shortened recording time combined with the automaticity of our method facilitates at-home implementations. Whereas the classification and regression models could predict preterm births with good accuracy, surprisingly, the regression models could not predict the gestational ages at delivery accurately, as shown in S4 Fig. This effect can be explained by the pathology of preterm births and by analyzing the distribution of the gestational ages at delivery. Preterm birth is an abnormal physiological condition, not just a pregnancy that happened to end early. Therefore, we can expect that physiological measurements, such as EHG recordings, may show a stronger dichotomy between pregnancies that end with either preterm or term deliveries than is shown in continuous characteristics correlated with gestational age at delivery.
We observe the dichotomous aspect of preterm and term births through the distribution of the gestational ages at delivery, shown in S1 and S2 Figs. The distribution of the gestational ages at birth of the mothers included in this work only from the TPEHG DB is left-skewed and does not appear to follow a Gaussian distribution, as shown in the panel d of S2 Fig. This skewness may can be caused by either an excess of preterm births compared to what would be expected if the gestational ages at birth followed a Gaussian distribution and by the induction of postterm births, which can skew the distribution towards earlier deliveries. However, when we exclude the preterm births the distribution of gestational ages at birth appears to follow a Gaussian distribution, as shown in the panel h of S2 Fig. This observation suggests that the skewness results from an over-representation of preterm births rather than from the induction of postterm births. Since the gestational ages at delivery do not follow a Gaussian distribution where the left tail accounts for preterm births, we suggest that the dynamics that dictate the gestational age at delivery do not follow a continuum between preterm and term births. Therefore, we propose that predicting the gestational age at delivery is more complicated than predicting preterm births using categorical outputs.
The significance of predicting preterm births several weeks before delivery is that it can be helpful in delaying preterm births and improving their outcomes. For example, clinical providers can prescribe progesterone to these mothers to prolong their pregnancies [50,51]. Additionally, medical providers could more frequently screen mothers at high risk of preterm birth to identify and treat hypertensive disease and cervical insufficiency [52,53]. Moreover, anticipating preterm births can be useful in planning for the birth to take place at a hospital with a neonatal intensive care unit (NICU), rather than at home, in birthing centers, or in hospitals without a NICU, thus avoiding ambulance transport and admission delays and improving outcomes [54][55][56]. Furthermore, identifying mothers at high risk of preterm birth may help researchers assess the efficacy of potential approaches and therapies to delay preterm births and improve their outcomes.
Although machine learning algorithms can contribute to improving healthcare and much research is yielding advances in this field, important challenges remain [57,58]. For example, machine learning predictions usually lack interpretability, meaning that it is challenging to identify the causes justifying the algorithms' predictions [57,58]. In our case, although our predictions could influence pregnancy management, our predictions would need to be supplemented with additional medical examinations to determine which therapies are more likely to reduce the risk of preterm birth and improve its outcomes. Additionally, machine learning algorithms in healthcare settings need to be carefully developed to protect data privacy and to prevent social biases from driving the predictions [58][59][60].
Despite the limitations of machine learning algorithms for developing medical devices, the number of medical products based on machine learning is steadily increasing thanks to their good performance [61]. By predicting preterm births with good accuracy directly from the measurements, while avoiding data leakage, our work is a step forward towards developing a medical device for predicting preterm births from EHG measurements using deep learning.
Our work is limited by the etiology of preterm birth and the dataset that we used to develop our models. Because preterm birth is a syndrome with many causes, it is most likely that no single physiological measurement will predict preterm births with perfect or nearly perfect accuracy [5,62]. A combination of measurements of various physiological processes is likely to produce better results [5,63].
The limited size of the datasets employed limits our work. We evaluated our prediction models using cross-validation rather than separating a subset of the data exclusively for testing after developing our models, because such a testing set would be too small for accurate performance evaluation [41]. For example, if we set apart 20% of the data for the final testing, this dataset would contain five preterm and 25 term samples. Moreover, all the samples in the dataset were acquired in a single hospital, and thus our model may not generalize well to measurements from mothers in different populations. Additionally, the datasets used in our work and in those mentioned in Table 3 excluded medically induced births and therefore, these populations may differ from general obstetric populations. However, Erkamp et al. found similar screening performance for preterm birth using sonographic measurements when either including or excluding medically induced births from their analysis [47]. Furthermore, we used a dataset with a larger proportion of preterm births than the general population due to the inclusion of the TPEHGT DS, which has the same number of term and preterm samples. This overrepresentation of preterm births can bias our results with respect to the expected performance in the general population, especially affecting the PPV and NPV, which depend on the incidence of preterm births. A larger database, preferably acquired across multiple healthcare centers, could rectify these limitations. Specifically, a larger database would enable us to separate a subset of samples to further evaluate the generalizability of our model. Moreover, because of the limited size of the database, we trained a small neural network with a limited number of parameters. In the future, a larger database would also enable us to train larger and more complex prediction models for better results [64].
Our work can be expanded to improve its performance and clinical value. First, following the same approach we used to combine EHG measurements with clinical data to predict preterm births, our method could incorporate other data, such as cervical length and fibronectin alpha measurements, which are likely to improve its performance. Additionally, to track the evolution of EHG activity towards birth and develop a dynamic prediction model, multiple EHG measurements could be recorded throughout pregnancy for each mother. Moreover, alternative techniques can be used for preprocessing EHG measurements that could potentially improve the performance of our prediction model [16,65]. Lastly, our work could be integrated with models connecting surface EHGs with uterine sources to include anatomical and physiological information for making predictions [66,67].

Conclusions
In summary, we developed a deep learning model to predict preterm births using clinical information and EHG measurements. Our method predicted preterm births more accurately than existing technologies. We also showed that preterm births can be predicted using short EHG recordings. Our work and results are useful for developing applications to predict preterm births early during pregnancy and for ultimately improving their outcomes. (a) Predicted gestational ages at birth, using the clinical information alone plotted against the true gestational ages at birth. Each blue circle shows the gestational age at delivery, predicted based on the clinical information and the true gestational age at delivery for a single mother. The solid black line represents the linear fit between the predictions and the true values. The dashed black line represents a perfect correspondence between predictions and true values. The legend shows the root mean square error (RMSE) of the predictions, the coefficient of determination (R 2 ) of the predictions, and the slope of the linear fit. (b) Bland-Altman plot for the predicted gestational ages at birth, using the clinical information alone and the true gestational ages at birth. Each blue circle represents the difference between predicted and true gestational ages at birth, and the mean of these values. The solid and dashed black lines show the mean of the difference between the predicted and the true values, and the 95% limits of agreement, calculated as mean ± 1.96 standard deviations, respectively. (c) Similar to (a), but using the predictions based on EHG measurements alone. (d) Similar to (b), but using the predictions based on EHG measurements alone. (e) Similar to (a), but using the predictions based on clinical information combined with EHG measurements. (f) Similar to (b), but using the predictions based on clinical information combined with EHG measurements. All values are presented as mean with 95% CI. (TIF)